The English-Slovene ACQUIS corpus
نویسنده
چکیده
The paper presents the SVEZ-IJS corpus, a large parallel annotated English-Slovene corpus containing translated legal texts of the European Union, the ACQUIS Communautaire. The corpus contains approx. 2 x 5 million words and was compiled from the translation memory obtained from the Translation Unit of the Slovene Government Office for European Affairs. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines TEI P4, where each translation memory unit contains useful metadata and the two aligned segments (sentences). Both the Slovene and English text is linguistically annotated at the word-level, by context disambiguated lemmas and morphosyntactic descriptions, which follow the MULTEXT guidelines. The complete corpus is freely available for research, either via an on-line concordancer, or for downloading from the corpus home page at http://nl.ijs.si/svez/.
منابع مشابه
Slovene-English Datasets for MT
Advances in machine translation are becoming increasingly dependent on the availability of large scale language resources, in particular parallel corpora. The talk presents Slovene-English language resources that were developed as datasets for translation studies and machine learning programs. Three parallel datasets are introduced: the MULTEXT-East multilingual word-annotated corpus, the IJS-E...
متن کاملQuality Estimation for Synthetic Parallel Data Generation
This paper presents a novel approach for parallel data generation using machine translation and quality estimation. Our study focuses on pivot-based machine translation from English to Croatian through Slovene. We generate an English–Croatian version of the Europarl parallel corpus based on the English–Slovene Europarl corpus and the Apertium rule-based translation system for Slovene–Croatian. ...
متن کاملEnriching Slovene WordNet with domain-specific terms
The paper describes an innovative approach to expanding the domain coverage of wordnet by exploiting multiple resources. In the experiment described here we are using a large monolingual Slovene corpus of texts from the domain of informatics to harvest terminology from, and a parallel English-Slovene corpus and an online dictionary as bilingual resources to facilitate the mapping of terms to th...
متن کاملA Multilingual Approach to Building Slovene Wordnet
The paper presents an experiment in which synsets for Slovene wordnet were induced automatically from several multilingual resources. Our research is based on the assumption that translations are a plausible source of semantically relevant information. More specifically, we argue that the translational relation on the one hand reduces ambiguity of a source word and on the other conveys semantic...
متن کاملA Parallel Corpus for Evaluating Machine Translation between Arabic and European Languages
We present Arab-Acquis, a large publicly available dataset for evaluating machine translation between 22 European languages and Arabic. Arab-Acquis consists of over 12,000 sentences from the JRCAcquis (Acquis Communautaire) corpus translated twice by professional translators, once from English and once from French, and totaling over 600,000 words. The corpus follows previous data splits in the ...
متن کامل